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EXECUTIVE SUMMARY 



The purpose of the current study was to determine the accuracy of a current 
voice recognition device (VRD) when used by naive speakers versus practiced 
speakers, in a speaker independent mode (one in which the VRD device relies 
on the speech patterns of individuals other than the current speaker). It 
is conceivable that in future applications of VR technology, it may be 
costly or impractical to provide practice and training to all users. 

The findings suggest that first time users of VR equipment, will obtain 
96.85% recognition accuracy, a level at least as high as that obtained by 
users who have received training or practiced speaking to the VRD. 
Neither nonrecognitions (e.g., errors where the system rejects the input 
and responds, in effect, with “I don't understand you, say it again") or 
mi srecognitions (e.g., errors where the system accepts the input but 
mistakes it for a different input) differed significantly for naive 
speakers versus practiced speakers. Furthermore, the mi srecogniti on rate 
for naive speakers was only 1.11%. 

It was concluded that training and practice may not always be necessary in 
order to obtain optimum performance in the human-VRD system. Without the 
need for practice, which implies modifying the human's behavior, the 
human-machine interaction is more natural, the "friendliness" of the VRD is 
enhanced, and the cost of the VR system use is reduced. 



1. INTRODUCTION 



1. 1 Background 

In recent years, voice technology has developed to the extent that basic 
systems have now been used successfully in several industrial and military 
applications. With constant improvements being made in the capabilities of 
voice recognition systems, their use in a wider variety of settings is 
already being contemplated. 

As the variety of settings widens, the requirements for the VRD become more 
diversified. One situation may require a VRD to recognize the speech of 
only one user who has thoroughly "trained" the system. Another situation 
might require the VRD to recognize the speech of several users, and, in 
some instances, to recognize the speech of a user for whom the VRD has no 
speech patterns recorded, in effect, a speaker independent situation. In 
the latter cases it would be desirable for the VRD to be capable of 
recognizing the speech of as many users as possible, without an increase in 
errors due to the variance of speech patterns from user to user. 

For purposes of this paper, we will refer to speaker independence as 
meaning where we use a speaker dependent recognizer but when a user talks 
to the recognizer, that user's voice patterns are never in memory. In any 
case, decisions must be made concerning the variety of stored speech 
patterns necessary for recognition of a user's speech in particular 
settings. 

1.2 Problem 



In recent experiments, Schwalm and Martin (1982) found that a currently 
available VRD performed with 95% recognition accuracy under speaker 
independent conditions. Their results were based on data from subjects who 
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had undergone a training session in which they practiced speaking to the 
VRD. This, in turn, could have optimized the VRD's recognition accuracy. 
While 95% recognition accuracy is impressive regardless of the possible 
effects of practice, the contribution that practice makes to recognition 
accuracy deserves investigation. Future applications of VR technology may 
involve users who have never trained a VRI) or practiced speaking to one. 
In some applications the VRD may be required to interact with a user 
population large enough to make training by all users impractical . 

The purpose of the present research was to determine the effects, if any, 
of training/practice on recognition accuracy. 

1.3 Objective 



The specific objective of the present research was to assess empirically 
the accuracy with which currently available VRDs could interpret utterances 
made by: (1) speakers who had received practice by training the VRD, and 

(2) speakers who had never trained or used a VRD. 
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2. METHOD 



2.1 Subjects 

Thirty volunteers (all males) were recruited from the Naval Postgraduate 
School in Monterey, California. Twenty-seven were students and three were 
staff. None had ever used voice recognition equipment before. 

2.2 Apparatus 

A Threshold Technology model T600 voice recognition device was used in this 
study. The device was capable of storing 256 voice utterances of up to 2 
seconds each. Fifty utterances were used in the present investigation. 
These utterances appear in Appendix A. 

A Shure model SM10 "boom" microphone (mounted on a headset) was used as the 
input device. This microphone is supplied as standard equipment with the 
T600. 

The Threshold system was linked to an IBM computer via a modem, allowing 
the experimenter to manipulate which set of speech patterns the Threshold 
would access when attempting to recognize the 50 utterances. 

2. 3 Experimental Design 

A 2x3x6 mixed design was employed in this experiment. Experience was a 
two-level between group variable. One group received practice by training 
the VRD (henceforth, "practiced" group) and the other group did not 
(henceforth, "naive" group). Each subject performed six trials, making 
trials the within group variable with six levels. Subjects in each 
experience level were divided into three groups, each of which accessed a 
different set of voice patterns in the VRD, making pattern set the second 
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between variable with three levels. A pattern set is a group of reference 
patterns, called templates, that the VRD refers to in determining what 
utterance has been made. These templates are created in the training 
phase, as described below. Each pattern set consisted of four templates 
for each of the fifty utterances in the vocabularly (4 voices (templates) x 
50 utterances = 200 templates per pattern set). In other words, a pattern 
set contained the trained templates from four random speakers on the same 
identical utterances listed in Appendix A. The use of three different 
pattern sets, each based on four different voices, provided internal 
replication of the experience by trials design, and allowed greater 
general i zati on of the results. A summary of the experimental design 
appears in Figure 2-1. 

2.4 Procedure 



2.4.1 Training. The term "training," as used in discussions of voice 
recognition studies, refers to the process by which the speaker makes known 
to the recognizer the characteristics of his particular speech patterns for 
all the utterances he will be using. For the T600, this training procedure 
consists of entering 10 passes of each utterance (10x50 or 500 utterances 
per subject) into the voice recognizer. The recognizer automatically 
averages the ten passes of each utterance into a single template, enters 
these templates into its "memory," and matches any subsequent utterances of 
the same vocabulary (in testing) with their templates in memory. Ideally, 
these subsequent utterances are matched with their templates in memory, 
resulting in correct response output on a CRT. In cases where a match is 
not possible a nonrecognition or rejection occurs, signified by a "beep" 
from the recognizer. In effect, the machine is saying "I don't understand 
that utterance--please say it again." Occasionally, however, the 
recognizer makes an incorrect match. In this case, an incorrect response 
is output on the CRT, constituting a "mi srecogni ti on . " Thus, two types of 
errors are possible: nonrecognitions (or rejections) and misrecognitions 
(or misinterpretations) of an utterance. 
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FIGURE 2-1. 

SUMMARY OF EXPERIMENTAL DESIGN 
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2.4.2 Testi ng . Each subject was scheduled to make two passes through the 
entire vocabulary list on each of three successive days. Subjects in the 
practiced group made 2 additional passes through the vocabularly list each 
day, providing further practice not received by the naive group. For the 
practiced group, these sessions were administered on Wednesday, Thursday, 
and Friday of the same week in which training took place. Testing sessions 
for the naive group were scheduled on Wednesday, Thursday, and Friday of a 
different week. Thus, a total of six testing trials were run for each 
subject. Both practiced and naive speakers were able to complete the 
experiment within one week. Subjects in the practiced group and the naive 
group never tested against a pattern set containing their own speech 
patterns, thus, both experience groups tested in the speaker independent 
mode. 



2.4.3 Summary . Fifteen subjects who had never used VR equipment before 
(naive subjects) tested a VRD along with 15 subjects who had trained and 
practiced using VR equipment (practiced subjects). Subjects in both groups 
tested the device in the speaker independent mode, and both practiced and 
naive speakers accessed identical pattern sets. Recognition accuracy was 
recorded for 300 critical utterances by each subject. While critical 
utterances were the only inputs naive speakers ever made to the VRI), each 
practiced speaker had made 1,100 additional inputs to the VRD as a result 
of training and practice sessions. 

2.5 Independent and Dependent Variables 

The independent variables in this study were pattern set, trials, and 
experience: practiced or naive. The dependent variables were 

nonrecognitions (or rejections), mi srecogni ti ons , and total errors, which 
was a linear combination of nonrecognitions and misrecognitions. 
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3. RESULTS 



3. 1 Overvi ew 

This section describes the results of the present study. All repeated 
measures analyses of variance procedures were performed using the arcsin 
transformation of raw data to stabilize the variance of the error terms 
(Neter and Wasserman, 1974). The mean error rates that appear in the 
tables and figures are untransformed. All a posteriori tests for 
significance between pairs of means were performed using the Scheffe 
procedures described in Bruning and Kintz (1977). 

As defined earlier, nonrecognitions and mi srecognitions by the voice 
recognition system may have distinctly different implications in an applied 
setting. In a weapons deployment activity, for example, it would be far 
more desirable for the system to respond to an input error by 
nonrecognition ( a "beep"), where the speaker is told to repeat or correct 
the input than for the system to misinterpret the input and to carry out 
some incorrect (and perhaps critical) command in error. Thus, it was 

considered essential to determine the effects of the independent variables 
on nonrecognitions and misrecognitions separately, as well as on total 
number of errors. 

Section 3.2 presents the data on total number of errors. Section 3.3 
presents the results of analyses done on nonrecognitions, while Section 3.4 
presents the results of analyses done on misrecognitions. 

3.2 Total Errors 



Table 3-1 presents the analysis of variance for total errors 
(nonrecognitions + misrecognitions). There were no significant effects of 
experience, pattern set, or trials, nor were there any significant 
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TABLE 3-1 



ANALYSIS OF VARIANCE SUMMARY TABLE 
FOR TOTAL ERRORS 



Source 


df 


MS 


F 


Experience (E) 


1 


.02053 


.053 


Pattern Set (P) 


2 


.08908 


.231 


Ex P 


3 


.13846 




Error 


24 


.38519 




Trials (T) 


5 


.03760 


1.743 


Tx E 


5 


.03193 


1.480 


Tx P 


10 


.02778 


1.288 


Tx Px E 


10 


.04021 


1.865 


Error 


120 


.02157 
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interactions. Mean total errors for experience by trials are shown in 
Table 3-2. 

3.3 Nonrecognitions 

An analysis of variance was performed on the nonrecognitions alone to 
determine the effects, if any, of experience, trials, and pattern sets. 
Table 3-3 presents the analysis of variance summary table for 
nonrecognitions. 

A significant main effect of trials (F=2.36, p<.0b) was found, as was a 
significant three-way interaction of trials by pattern set by experience 
( F=2 .219, P<.05). No other main effects or interactions were statistically 
significant. Mean nonrecognitions for experience by trials are shown in 
Table 3-4. The main effect of trials, and the three-way interaction of 
trials by pattern set by experience are portrayed graphically in Figures 
3-1 and 3-2, respectively. 

With regard to the main effect of trials, although the analysis of variance 
indicated a significant trials effect, review of Figure 3-1 reveals no 
apparent systematic change over trials. A Scheffe test for significance 
between pairs of means detected jio significant differences between any two 
trials. Evidently, the analysis of variance is sensitive to the spurious 
nature of errors across trials. However, the difference between even the 
highest and lowest error rates over trials is not large enough to reach 
statistical significance in the post hoc Scheffe test. For further 
discussion on post hoc range tests, and lack of significance in post hoc 
tests where significance was reached in an analysis of variance, see J.L. 
Myers, 1972. 
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TABLE 3-2 



MEAN TOTAL ERRORS (IN PERCENT) 
FOR EXPERIENCE BY TRIALS 





TRIALS 




1 


2 


3 


4 


5 


6 


x Trials 


E 

X 

P 

E 

R 

I 

E 

N 

C 

E 


PRACTICED 


5.20 


3.60 


5.60 


5.33 


4.27 


5.20 


4.87 


NAIVE 


4.00 


3.60 


2.67 


2.80 


2.80 


3.07 


3.15 




EXPERIENCE 


4.60 


3.60 


4.14 


4.07 


3.53 


4.1 


Grand >< 
4.01 
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TABLE 3-3 



ANALYSIS OF VARIANCE SUMMARY 
TABLE FOR NONRECOGNITIONS 



Source 


df 


MS 


F 


Experience (E) 


1 


.05712 


.158 


Pattern Set (P) 


2 


.02264 


.063 


Ex P 


2 


.05488 


.152 


Error 


24 


.36168 




Trials (T) 


5 


.04666 


2.356* 


Tx E 


5 


.03194 


1.613 


Tx P 


10 


.03147 


1.589 


Tx Px E 


10 


.04395 


2.219* 


Error 


120 


.01980 





*P < .05 
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TABLE 3-4 



MEAN NONRECOGNITIONS (IN PERCENT) 
FOR EXPERIENCE BY TRIALS 





TRIALS 




1 


2 


3 


4 


5 


6 


x Trials 


E 

X 

P 

E 

R 

I 

E 

N 

C 

E 


PRACTICED 


3.60 


2.27 


3.73 


4.13 


3.47 


4.13 


3.56 


NAIVE 


3.47 


2.13 


1.60 


1.47 


1.60 


2.00 


2.04 




X 

EXPERIENCE 


3.53 


2.20 


2.67 


2.80 


2.53 


3.07 


Grand "x 
2.80 
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ERROR 
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FIGURE 3-1 

NONRECOGNITIONS BY TRIALS 
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Practiced 
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FIGURE 3-2 

NONRECOGNITIONS FOR EXPERIENCE 
BY TRIALS BY PATTERN SET 



The experience by trials by pattern set interaction also reached 
significance in the analysis of variance. Again, there were no 
interpretable or systematic effects, and the authors attach no practical 
significance to either the trials or the experience by trials by pattern 
set interaction. 

3.4 Misrecognitions 

As for nonrecognitions, an analysis of variance was performed on the 
misrecognitions alone to determine the effects, if any, of experience, 
pattern sets, and trials. Table 3-5 presents the analysis of variance 
summary table for misrecognitions. 

A significant main effect of pattern sets (F=6.02, p<.01) is evident. The 
main effects of experience and trials were not significant, nor were any of 
the interactions. Mean misrecognitions for experience by pattern set are 
shown in Table 3-6, and the effect of pattern sets is portrayed graphically 
in Figure 3-3. 

With regard to the main effect of pattern sets, a Scheffe test for 
significance between pairs of means was performed to determine where such 
differences lie. Again, as was the case for nonrecognition trials, the 
main effect of misrecognitions by pattern sets, reported in the analysis of 
variance, could not be detected in the Scheffe test. (Review Figure 3-3 
for further clari fication. ) Misrecognitions do vary somewhat as a function 
of pattern set. However, the greatest number of errors (pattern set 1) was 
2.23", leaving little range for variability with a floor of zero. With the 
stringent per comparison alpha level imposed by the Scheffe test, the 
difference in range between pattern set one and pattern set three (where 
the least errors occurred) did not reach significance. All statistical 
results considered, the effect of pattern sets may be attributed to greater 
dissimilarity between the voices of subjects and contributors of pattern 
set one, than between voices of subjects and contributors of pattern sets 2 
and 3. 
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TABLE 3-5 



ANALYSIS OF VARIANCE SUMMARY 
TABLE FOR MISRECOGNITIONS 



Source 


df 


MS 


F 


Experience (E) 


1 


.00000 


0 


Pattern Set (P) 


2 


. 39584 


6.02* 


Ex P 


2 


.08367 


1.272 


Error 


24 


.06575 




Trials (T) 


5 


.01504 


.728 


Tx E 


5 


.03154 


1.525 


Tx P 


10 


.02492 


1.205 


Tx Px E 


10 


.01496 


.724 


Error 


120 


.02067 





*P < .01 
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TABLE 3-6 



MEAN MISRECOGNITIONS (IN PERCENT) 
FOR EXPERIENCE BY PATTERN SET 





PATTERN SET 




1 


2 


3 


x Pattern Sets 


E 

X 

P 

E 

R 

I 

E 

N 

C 

E 


PRACTICED 


2.93 


.53 


.47 


1.31 


NAIVE 


1.53 


1.13 


.67 


1.11 




x 

EXPERIENCE 


2.23 


.83 


.57 


Grand x 
1.21 
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FIGURE 3-3 

MISRECOGNITIONS BY PATTERN SETS 
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4. DISCUSSION 



The following section discusses some implications of the aforementioned 
results. 

4. 1 Total Errors 

There were no significant differences in the number of total errors 
produced by practiced speakers versus naive speakers. In positive terns, 
naive speakers obtained recognition accuracy of 96.85%, with the VRD 
relying on the speech patterns of four independent speakers. This 
performance represents a slight (1.72%) but stati sti cal ly non-significant 
improvement over practiced speakers, and lends further support to previous 
findings of greater than 95% recognition accuracy in the speaker 
independent mode in general (Schwalm 5 Martin, 1982). 

4.2 Nonrecognitions 

Nonrecognitions accounted for 70% of the total errors. As was the case 
with total errors, there were slightly fewer (1.52%) nonrecognitions 
produced by naive speakers, however, this difference was non-significant. 

4.3 Mi srecogni ti ons 

As was the case with total errors and nonrecognitions, naive speakers 
produced slightly fewer misrecognitions (.2%) than practiced speakers, 
again the difference was non-significant. Misrecognitions accounted for 
only 30% of the total errors, a fortunate finding since misrecognitions are 
the more problematic of the two types of errors, as explained earlier. 



4-1 



The question arises as to why, even though not stati stica 1 ly significant, 
naive speakers seem to make less errors than practiced speakers. 

An explanation for the apparently better performance of naive subjects as 
opposed to practiced subjects may be linked to the effects of stress on 
voice recognition performance. In a previous study (Schwalm, 1983), it was 
found that speakers' attitudes about their performance in the initial 
stages of using voice recognition technology appeared to contribute to 
their subsequent performance. It is entirely possible that subjects who 
had used voice recognition equipment before felt that they should be able 
to use that equipment with a high level of proficiency (even though there 
may be no real objective reasons to expect this). If subjects really felt 
that this should be the case, they may have entered the experiment with 
some self-imposed expectations of achieving a high level of performance 
during the experiment. It is therefore possible that when the subjects 
made their first few errors, they became frustrated (or stressed, in the 
general sense) and that the quality of their subsequent inputs was degraded 
(see Schwalm, 1983). Thus, poorer performance for the practiced group 
might be expected. 

It is important to note that the above explanation based on self-imposed 
(psychological) stress is speculative at this point. The authors feel that 
the entire area of psychological (as well as other sources of) stress, as 
it applies to performance with voice recognition technology, deserves 
considerable research attention in the future. If individuals will be 
required to use voice recognition equipment in a growing number of 
appl ications , and if (as it appears at this time) stress changes the 
quality of voice input, there is significant value in determining just how 
stress affects the users of voice recognition equipment and their 
performance. 
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5. CONCLUSION 



The present research has shown that a person who has never trained or 
practiced speaking to a VRD can obtain 96.85% recognition accuracy with the 
VRD relying on the speech patterns of four independent speakers. This 
degree of accuracy does not differ significantly from speakers who did 
train the VRD and practiced speaking the vocabulary. In the speaker 
independent node, training is not associated with any significant cost or 
benefit in recognition accuracy. In other words, training and practice may not 
be necessary, a situation favorable to the potential applications of VR 
technology. 

Some human-machine systems involve very high "friendliness" demands. In 
some applications, the need for all users to train or practice speaking to 
the VRD represents an acceptable cost. However, in other applications 
(with large or unspecified populations) the need for all users to train and 
practice speaking to the VRD could be so impractical that it would 
eliminate voice as a method of input. The current findings suggest that 
voice is a viable method of input, not requiring training and practice for 
successful operation. 

The reader is reminded of some pertinent qual ifications to these findings. 

All subjects were male, native English speakers from the Naval Postgraduate 
School, ranging from about 25 to 35 years of age. The three pattern sets 
that the subjects tested against were created by subjects who met these 
same criteria. Under a conservative interpretation, the 95% average 
recognition rate might decrease in a real world situation involving a more 
diversified user population. However, if the pattern sets were constructed 
selectively, rather than by random assignment, the 96% recognition rate 
might logically be expected to increase. Future research at the Naval 
Postgraduate School will investigate spectrographi c speech characteristics 



in an effort to qualify and optimize the speech patterns stored in the 
VRD's memory. All things considered, the authors are confident that the 
current findings reflect the capability of state of the art VRDs to 
interact successfully with untrained, unpracticed users such as those who 
participated in the present investigation. 
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APPENDIX A 



UTTERANCE 


WORD # 


UTTERANCE 


ONE 


26 


SIERRA 


YANKEE 


27 


APPLICATION 


GARY POOCK 


28 


HUMAN FACTORS 


CARRIAGE RETURN 


29 


CENTRAL EXPRESSWAY 


IRAN 


30 


FILE TRANSFER PROTOCOL 


SWEDEN 


31 


NINE 


LOGIN POOCK 


32 


INDIA 


ACCAT TITLE 


33 


LIMA 


LOAD GLD3 


34 


POPPA 


POOCK NPS PASSWORD 


35 


UNIFORM 


THREE 


36 


KOREA 


LOGOUT 


37 
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